System Design Refresher

1. Networking & Communication

IP Addressing
DNS (Domain Name System)
Load Balancers
- L4 vs L7
Proxy
- Forward Proxy
- Reverse Proxy
TCP vs UDP
HTTP/HTTPS Basics
- HTTP Methods
- Important Headers
- HTTP Status Codes
- REST vs GraphQL vs gRPC
Real-Time Communication
- WebSockets
- Long Polling
- Server-Sent Events (SSE)
CDN (Content Delivery Network)

2. Storage & Databases

Relational Databases (SQL)
- ACID Properties
- Isolation Levels
- Normalization
Database Replication Patterns
- Master-Slave (Primary-Replica)
- Master-Master (Multi-Master)
NoSQL Databases
- Key-Value Stores
- Document Databases
- Wide-Column Stores
- Graph Databases
Database Indexes
- B-Tree Indexes
- Hash Indexes
- Inverted Indexes
- Geospatial Indexes
Replication & Sharding
- Horizontal Scaling (Sharding)
- Vertical Scaling
CAP Theorem
Query Optimization
CDC (Change Data Capture)
Full-Text Search
Caching Strategies
- Cache Levels
- Cache Patterns
- Cache Eviction Policies
- Cache Stampede

3. Scalability & Reliability

Load Balancing Algorithms
- Round Robin
- Least Connections
- Consistent Hashing
- IP Hash
- Least Response Time
Rate Limiting & Throttling
- Algorithms (Token Bucket, Leaky Bucket, etc.)
Message Queues & Streams
- Kafka
- RabbitMQ
- AWS SQS
- Backpressure
- Consumer Groups
Leader Election
- Raft
- Paxos
- ZooKeeper
Failover & Redundancy
- Active-Passive
- Active-Active

4. System Design Patterns

Read vs Write-Heavy Systems
CQRS (Command Query Responsibility Segregation)
Event Sourcing
Caching Patterns (Revisited)
- Write-Through
- Write-Back
- Write-Around
Idempotency & Retries
Consistency Models
- Strong Consistency
- Eventual Consistency
- Causal Consistency
- Read-Your-Writes Consistency
- Monotonic Reads

6. Observability

Backoff, Jitter, and Retry Strategies
- Exponential Backoff
- Jitter
- Circuit Breaker Pattern
Logging Best Practices
- Structured Logging
- Log Levels
- Correlation IDs
Monitoring & Metrics
- Key Metrics (RED Method)
- USE Method
- Golden Signals
- Metric Types
- Prometheus & Grafana
Alerting Best Practices
Distributed Tracing
- Jaeger
- OpenTelemetry
SLO, SLI, SLA
- Error Budgets

7. Security & Privacy

Authentication vs Authorization
Authentication Mechanisms
- JWT
- OAuth 2.0
- SSO
- Session-Based Authentication
Encryption
- TLS/HTTPS
- Data at Rest
- Hashing vs Encryption
DDoS Protection & WAF
Data Privacy (GDPR Basics)

8. Infrastructure & Deployment

Containers & Orchestration
- Docker
- Kubernetes
CI/CD Pipelines
- Continuous Integration
- Continuous Deployment
- Deployment Strategies
Service Discovery
API Gateway
Microservices vs Monoliths

9. Special Topics

Search Systems
- Inverted Index
- Ranking Algorithms
- Elasticsearch Architecture
Bloom Filters
Recommendation Systems
- Collaborative Filtering
- Content-Based Filtering
Distributed Transactions
- Two-Phase Commit (2PC)
- Saga Pattern
Consensus Algorithms
- Raft
- Paxos
Time & Ordering in Distributed Systems
- Lamport Clocks
- Vector Clocks
- True Time

1. Networking & Communication

IP Addressing

IPv4: 32-bit addresses (e.g., 192.168.1.1), supports ~4.3 billion addresses
IPv6: 128-bit addresses, designed to solve IPv4 exhaustion
Private vs Public IPs: Private IPs (10.x.x.x, 192.168.x.x) for internal networks, public for internet-facing
CIDR Notation: 192.168.1.0/24 means first 24 bits are network, last 8 for hosts

DNS (Domain Name System)

Translates domain names to IP addresses
Hierarchical system: Root → TLD (.com) → Authoritative nameserver
Record types:
- A: Maps domain to IPv4
- AAAA: Maps to IPv6
- CNAME: Alias to another domain
- MX: Mail server
- NS: Nameserver
DNS caching: Browsers, OS, recursive resolvers cache results (TTL-based)
Interview tip: DNS is often a single point of failure; use multiple nameservers

Load Balancers

L4 (Layer 4 - Transport Layer)

Operates at TCP/UDP level
Routes based on IP address and port
Faster, less inspection overhead
Cannot route based on content (URL, headers)
Use case: High-throughput, low-latency requirements

L7 (Layer 7 - Application Layer)

Operates at HTTP/HTTPS level
Routes based on URLs, headers, cookies
Can do SSL termination, content-based routing
More CPU intensive
Use case: Microservices with different endpoints, A/B testing

Proxy

Forward Proxy

Client-side proxy
Client → Forward Proxy → Internet
Use cases:
- Content filtering in organizations
- Anonymity (VPN-like behavior)
- Caching for clients
Example: Corporate proxy server

Reverse Proxy

Server-side proxy
Client → Reverse Proxy → Backend servers
Use cases:
- Load balancing
- SSL termination
- Caching
- Security (hide backend infrastructure)
Examples: Nginx, HAProxy

TCP vs UDP

Feature	TCP	UDP
Connection	Connection-oriented (3-way handshake)	Connectionless
Reliability	Guaranteed delivery, ordered	No guarantee, may lose/reorder packets
Speed	Slower (overhead)	Faster
Use cases	HTTP, SSH, File transfers	Video streaming, DNS, Gaming, VoIP
Flow control	Yes (prevents overwhelming receiver)	No
Error checking	Extensive	Basic checksum

Interview insight: TCP trades speed for reliability; UDP trades reliability for speed

HTTP/HTTPS Basics

HTTP Methods

GET: Retrieve resource (idempotent, cacheable)
POST: Create resource (not idempotent)
PUT: Update/replace entire resource (idempotent)
PATCH: Partial update (not necessarily idempotent)
DELETE: Remove resource (idempotent)
HEAD: Like GET but without response body
OPTIONS: Check available methods

Important Headers

Cache-Control: Directives for caching (max-age, no-cache, no-store)
ETag: Resource version identifier for conditional requests
Authorization: Bearer tokens, API keys
Content-Type: MIME type (application/json, text/html)
User-Agent: Client information
Accept: Content types client can process

HTTP Status Codes

2xx: Success (200 OK, 201 Created, 204 No Content)
3xx: Redirection (301 Moved Permanently, 304 Not Modified)
4xx: Client errors (400 Bad Request, 401 Unauthorized, 404 Not Found, 429 Too Many Requests)
5xx: Server errors (500 Internal Server Error, 502 Bad Gateway, 503 Service Unavailable)

REST vs GraphQL vs gRPC

REST

Resource-based URLs (/users/123)
Standard HTTP methods
Over-fetching/under-fetching possible
Easy caching via HTTP
Best for: Public APIs, CRUD operations

GraphQL

Single endpoint (/graphql)
Client specifies exact data needed
Reduces over-fetching
More complex server-side
Best for: Complex data requirements, mobile apps (bandwidth concerns)

gRPC

Uses Protocol Buffers (binary format)
HTTP/2 based, bidirectional streaming
Strongly typed contracts
More efficient than JSON
Best for: Internal microservices, high-performance requirements

Real-Time Communication

WebSockets

Full-duplex communication over single TCP connection
Persistent connection (after HTTP upgrade)
Low latency, real-time bidirectional
Use cases: Chat apps, live trading, multiplayer games
Trade-off: Stateful, harder to scale (requires sticky sessions)

Long Polling

Client requests, server holds connection until data available
Then responds, client immediately requests again
More overhead than WebSockets but better compatibility
Use cases: Real-time updates where WebSockets unavailable

Server-Sent Events (SSE)

Server pushes updates to client over HTTP
Unidirectional (server → client)
Auto-reconnects, built-in event IDs
Use cases: News feeds, stock tickers, notifications
Trade-off: Only server-to-client (unlike WebSockets)

CDN (Content Delivery Network)

Distributed servers at edge locations (geographically closer to users)
Benefits:
- Reduced latency (geographic proximity)
- Lower bandwidth costs
- DDoS protection
- High availability
Edge caching: Static assets cached at CDN edges
Geo-replication: Content replicated across regions
Invalidation: Can purge/update cached content
Examples: CloudFlare, Akamai, CloudFront

2. Storage & Databases

Relational Databases (SQL)

ACID Properties

Atomicity: All operations in transaction succeed or all fail (no partial states)
Consistency: Database remains in valid state (constraints honored)
Isolation: Concurrent transactions don't interfere
Durability: Committed data persists even after crashes

Isolation Levels

Read Uncommitted: Can read uncommitted changes (dirty reads)
Read Committed: Only reads committed data (default in many DBs)
Repeatable Read: Same query returns same results in transaction
Serializable: Strongest isolation, transactions fully isolated

Normalization

1NF: Atomic values, no repeating groups
2NF: 1NF + no partial dependencies (all non-key attributes depend on entire primary key)
3NF: 2NF + no transitive dependencies
Trade-off: More normalized = less redundancy but more joins; denormalization for read performance

Database Replication Patterns

Master-Slave (Primary-Replica)

Write: Goes to master only
Read: Can be served from slaves/replicas
Pros: Scales reads, simple architecture
Cons: Master is single point of failure for writes, replication lag
Use case: Read-heavy workloads (90% reads, 10% writes)

Master-Master (Multi-Master)

Write: Can go to any master
Read: From any master
Pros: No single point of failure for writes, better write scaling
Cons: Complex conflict resolution, potential data inconsistencies
Conflict resolution: Last-write-wins, version vectors, custom logic
Use case: Globally distributed systems, high write availability

NoSQL Databases

Key-Value Stores

Structure: Simple key → value mapping
Examples: Redis, DynamoDB, Riak
Pros: Extremely fast, simple, horizontally scalable
Cons: Limited query capabilities (no joins, no complex queries)
Use cases: Session storage, caching, shopping carts

Document Databases

Structure: Store JSON-like documents
Examples: MongoDB, CouchDB, Firestore
Pros: Flexible schema, good for hierarchical data, can index/query fields
Cons: No joins (embed or reference), eventual consistency
Use cases: Content management, user profiles, catalogs

Wide-Column Stores

Structure: Column families, rows can have different columns
Examples: Cassandra, HBase, BigTable
Pros: Efficient for sparse data, scales horizontally, fast writes
Cons: Complex data modeling, eventual consistency
Use cases: Time-series data, IoT sensors, event logging

Graph Databases

Structure: Nodes, edges, properties
Examples: Neo4j, JanusGraph, Amazon Neptune
Pros: Efficient for relationship queries, natural for connected data
Cons: Harder to scale horizontally, specialized use cases
Use cases: Social networks, recommendation engines, fraud detection

Database Indexes

B-Tree Indexes

Structure: Balanced tree, sorted data
Pros: Good for range queries, ordered data
Operations: O(log n) for search, insert, delete
Use case: Default index in most SQL databases
Example: WHERE age BETWEEN 25 AND 35

Hash Indexes

Structure: Hash table
Pros: O(1) for exact match lookups
Cons: No range queries, no ordering
Use case: Equality comparisons only
Example: WHERE user_id = 123

Inverted Indexes

Structure: Maps terms to document IDs containing them
Used in: Full-text search engines
Example:
- Doc1: "quick brown fox"
- Doc2: "brown dog"
- Index: "brown" → [Doc1, Doc2]
Use case: Search functionality (Elasticsearch)

Geospatial Indexes

Types:
- R-tree: For spatial data (rectangles, polygons)
- Quadtree: Divides space into quadrants recursively
- Geohash: Encodes lat/long into string
Use cases: "Find restaurants within 5km", location-based services
Examples: MongoDB geospatial, PostGIS

Replication & Sharding

Horizontal Scaling (Sharding)

Definition: Distribute data across multiple machines
Sharding strategies:
- Hash-based: hash(key) % num_shards
- Range-based: user_id 1-1M on shard1, 1M-2M on shard2
- Geography-based: EU users on EU shard, US users on US shard
- Directory-based: Lookup table maps keys to shards
Challenges:
- Cross-shard joins expensive
- Rebalancing shards when adding nodes
- Choosing good shard key (avoid hotspots)

Vertical Scaling

Definition: Add more resources (CPU, RAM) to single machine
Pros: Simpler (no distributed complexity)
Cons: Limited by hardware limits, expensive, single point of failure
When to use: Before reaching limits, for databases requiring strong consistency

CAP Theorem

You can only have 2 of 3: Consistency, Availability, Partition tolerance

Consistency: All nodes see same data at same time
Availability: Every request gets response (success/failure)
Partition tolerance: System works despite network partitions

In practice: Network partitions will happen, so choose between:

CP (Consistency + Partition tolerance): Sacrifice availability during partition
- Examples: HBase, MongoDB (strong consistency mode)
- Use case: Financial systems, inventory
AP (Availability + Partition tolerance): Sacrifice consistency during partition
- Examples: Cassandra, DynamoDB, Riak
- Use case: Social media feeds, analytics

PACELC Theorem (more realistic):

If Partition, choose A or C
Else (no partition), choose Latency or Consistency

Query Optimization

Techniques

Indexes: Most critical (but adds write overhead)
Query analysis: Use EXPLAIN to see execution plan
Avoid SELECT *: Fetch only needed columns
Limit result sets: Pagination, WHERE clauses
Denormalization: For read-heavy workloads
Partitioning: Split large tables
Connection pooling: Reuse database connections
Caching: Redis for frequently accessed data

Common Issues

N+1 queries: Fetching related data in loop (use joins or batch fetches)
Full table scans: Missing indexes on WHERE/JOIN columns
Suboptimal joins: Wrong join order or type

CDC (Change Data Capture)

Purpose: Track changes in database (inserts, updates, deletes)
Methods:
- Log-based: Read database transaction logs (MySQL binlog, Postgres WAL)
- Trigger-based: Database triggers on changes
- Timestamp-based: Check last_modified column
Use cases:
- Data replication to data warehouse
- Invalidating caches
- Event-driven architectures
Tools: Debezium, Maxwell, AWS DMS

Full-Text Search

Problem: SQL LIKE '%keyword%' is slow (can't use indexes)
Solution: Specialized search engines with inverted indexes
Features:
- Tokenization (breaking text into terms)
- Stemming (run, running → run)
- Relevance scoring (TF-IDF, BM25)
- Fuzzy matching (typo tolerance)
Examples: Elasticsearch, Solr, Algolia
Architecture: Separate search cluster, sync from primary DB via CDC

Caching Strategies

Cache Levels

Client-side: Browser cache, mobile app cache
CDN: Edge servers cache static assets
Reverse proxy: Nginx caches responses
Application cache: In-memory (Redis, Memcached)
Database cache: Query result cache, buffer pool

Cache Patterns

Cache-Aside (Lazy Loading)

Check cache
If miss: fetch from DB, populate cache
Return data

Pros: Only caches requested data
Cons: Cache miss penalty, potential stale data

Read-Through

Cache sits between app and DB
Cache handles DB fetching automatically
Pros: Simpler app code
Cons: Cache miss still slow

Write-Through

Writes go to cache and DB synchronously
Pros: Cache always consistent
Cons: Higher write latency

Write-Behind (Write-Back)

Writes go to cache, asynchronously written to DB
Pros: Low write latency
Cons: Risk of data loss, complex

Write-Around

Writes go directly to DB, bypass cache
Pros: Avoids cache pollution from writes
Cons: Cache miss on next read

Cache Eviction Policies

LRU (Least Recently Used): Evict oldest accessed item (good general purpose)
LFU (Least Frequently Used): Evict least accessed item (good for stable access patterns)
FIFO (First In First Out): Evict oldest item (simple but not optimal)
TTL (Time To Live): Evict after fixed time (good for time-sensitive data)
Random: Evict random item (simple, surprisingly effective)

Cache Stampede (Thundering Herd)

Problem: Cache expires, multiple requests hit DB simultaneously
Solutions:
- Lock on cache miss (first request fetches, others wait)
- Probabilistic early expiration
- Background refresh before expiration

3. Scalability & Reliability

Load Balancing Algorithms

Round Robin

Distribute requests sequentially across servers
Pros: Simple, fair distribution
Cons: Doesn't consider server load/capacity
Weighted Round Robin: Assign more requests to powerful servers

Least Connections

Send to server with fewest active connections
Pros: Better for long-lived connections
Cons: Requires tracking connection state

Consistent Hashing

Hash both requests and servers onto ring
Request goes to next clockwise server
Pros: Minimal redistribution when adding/removing servers
Cons: Can create hotspots
Solution: Virtual nodes (multiple positions per server)
Use case: Distributed caches, sharding

IP Hash

Hash client IP to determine server
Pros: Same client always goes to same server (session affinity)
Cons: Uneven distribution if IPs not diverse

Least Response Time

Send to server with fastest response
Pros: Adapts to server performance
Cons: Requires health checks, more complex

Rate Limiting & Throttling

Why Rate Limit?

Prevent abuse/DoS attacks
Fair resource allocation
Cost control (API quotas)
Ensure quality of service

Algorithms

Token Bucket

Bucket holds tokens (refilled at fixed rate)
Request consumes token
Pros: Handles bursts, smooth rate
Cons: More complex
Example: AWS API Gateway

Leaky Bucket

Requests enter bucket, leak out at fixed rate
Pros: Smooth output rate
Cons: No burst handling
Use case: Network traffic shaping

Fixed Window

Allow N requests per time window (e.g., 100/hour)
Pros: Simple
Cons: Burst at window boundaries (200 requests in 1 second if split across windows)

Sliding Window Log

Track timestamp of each request
Count requests in sliding time window
Pros: Accurate, no boundary burst
Cons: Memory intensive (store all timestamps)

Sliding Window Counter

Combines fixed window + weighted previous window
Pros: Accurate, memory efficient
Cons: Slightly complex

Implementation

Storage: Redis (INCR with EXPIRE)
Response: 429 Too Many Requests + Retry-After header
Distributed: Use centralized Redis, not in-memory (consistent across servers)

Message Queues & Streams

Use Cases

Decoupling: Producers/consumers don't need to know about each other
Async processing: Handle time-consuming tasks
Load leveling: Queue absorbs traffic spikes
Reliability: Messages persist until processed

Kafka

Model: Distributed log (append-only)
Key features:
- High throughput (millions msgs/sec)
- Partitions for parallelism
- Persistent storage
- Consumer groups
- Replay capability (seek to offset)
Use cases: Event streaming, log aggregation, real-time analytics

RabbitMQ

Model: Traditional message broker
Key features:
- Multiple exchange types (direct, topic, fanout)
- Acknowledgments
- Priority queues
- Dead letter queues
Use cases: Task queues, RPC, routing

AWS SQS

Model: Managed queue service
Types:
- Standard: At-least-once delivery, best-effort ordering
- FIFO: Exactly-once, strict ordering
Features: Auto-scaling, dead letter queues, visibility timeout
Use cases: Decoupling microservices, job queues

Backpressure

Problem: Slow consumers can't keep up with producers
Solutions:
- Push-back to producers (reject requests)
- Dynamic batching
- Increase consumer parallelism
- Drop messages (if acceptable)

Consumer Groups

Concept: Multiple consumers in group share message processing
Kafka: Each partition assigned to one consumer in group (parallelism = partition count)
Benefits: Horizontal scaling, fault tolerance

Leader Election

Why?

Ensure single coordinator in distributed system
Prevent split-brain scenarios
Coordinate distributed operations

Algorithms

Raft

Leader elected via voting
Heartbeats maintain leadership
Log replication for state machine
Pros: Understandable, proven
Used in: etcd, Consul

Paxos

Consensus via proposers, acceptors, learners
Pros: Theoretically sound
Cons: Complex to implement
Used in: Google Chubby

ZooKeeper (ZAB protocol)

Centralized coordination service
Sequential consistency
Use cases: Configuration management, leader election, distributed locks
Drawback: Single point of failure (mitigated by quorum)

Failover & Redundancy

Active-Passive (Master-Standby)

Setup: One active server, one standby
Failover: Standby takes over if active fails
Pros: Simpler, no split-brain risk
Cons: Wasted resources (standby idle)
Use case: Databases, critical services

Active-Active (Multi-Master)

Setup: All servers handle traffic
Pros: Better resource utilization, no failover delay
Cons: Complex conflict resolution, data sync
Use case: Stateless services, CDNs

Health Checks

Types:
- Passive: Monitor logs/metrics
- Active: Periodic pings/HTTP checks
Considerations: Check interval vs false positives, cascading failures

4. System Design Patterns

Read vs Write-Heavy Systems

Read-Heavy Optimization

Caching: Aggressive caching (Redis, CDN)
Read replicas: Multiple database replicas
Denormalization: Duplicate data to avoid joins
Indexing: Optimize for common queries
Examples: Social media feeds, news sites, e-commerce browsing

Write-Heavy Optimization

Write buffering: Queue writes, batch inserts
Asynchronous processing: Background workers
Eventual consistency: Accept temporary inconsistency
Sharding: Distribute writes across nodes
Optimize indexes: Fewer indexes (faster writes, slower reads)
Examples: IoT data ingestion, logging systems, analytics

CQRS (Command Query Responsibility Segregation)

Concept

Separate models for reads (queries) and writes (commands)
Different databases optimized for each

Write Side (Command)

Handles business logic
Validates and processes commands
Emits events

Read Side (Query)

Optimized for queries (denormalized views)
Updated via events from write side
Eventually consistent

Benefits

Independent scaling of reads/writes
Optimized data models for each
Clear separation of concerns

Drawbacks

Increased complexity
Eventual consistency challenges
Need to sync read models

Use Cases

Complex domains (e-commerce, banking)
Different read/write patterns
Multiple read models (different views of data)

Event Sourcing

Concept

Store events (state changes) instead of current state
Rebuild state by replaying events

Example

Events:
1. AccountCreated(id:123, balance:0)
2. MoneyDeposited(id:123, amount:100)
3. MoneyWithdrawn(id:123, amount:30)

Current state: balance = 70 (derived from events)

Benefits

Complete audit trail
Time travel (replay to any point)
Debugging (see what happened)
Multiple projections from same events

Drawbacks

Complex queries (need to replay events)
Storage growth
Schema evolution challenges

Combined with CQRS

Events from write side → update read models
Perfect synergy

Caching Patterns (Revisited)

Write-Through Caching

Flow: Write to cache → sync write to DB
Pros: Cache always up-to-date
Cons: Higher write latency (dual writes)
Use case: Read-heavy with some writes

Write-Back (Write-Behind) Caching

Flow: Write to cache → async write to DB (batched)
Pros: Fast writes, batch optimization
Cons: Data loss risk, complexity
Use case: High write throughput (logs, analytics)

Write-Around Caching

Flow: Write directly to DB, invalidate/bypass cache
Pros: Doesn't pollute cache with write data
Cons: Next read will be cache miss
Use case: Write-once, read-rarely data

Idempotency & Retries

Idempotency

Definition: Same operation can be applied multiple times without changing result
Examples:
- Idempotent: DELETE /user/123 (same result each time)
- Not idempotent: POST /user (creates new user each time)

Why Important?

Network failures require retries
Distributed systems need duplicate handling
Prevent double-charging, duplicate records

Implementation

Idempotency keys: Client generates unique ID per request

POST /payment
Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000

Server stores: key → result mapping
On retry: Return cached result if key exists

Retry Strategies

Exponential backoff: Wait 1s, 2s, 4s, 8s... (with jitter)
Circuit breaker: Stop retrying after threshold (prevent cascading failures)
Deadline propagation: Don't retry if deadline exceeded

Consistency Models

Strong Consistency (Linearizability)

Reads always return latest write
Pros: Simple reasoning, no surprises
Cons: Higher latency, lower availability
Examples: Traditional RDBMS, ZooKeeper
Use case: Financial transactions, inventory

Eventual Consistency

Reads may return stale data temporarily
Eventually all replicas converge
Pros: High availability, low latency
Cons: Complex application logic
Examples: DynamoDB, Cassandra, DNS
Use case: Social media, analytics

Causal Consistency

Causally-related operations seen in order
Concurrent operations can be seen in any order
Example:
- Post message → Like message (causal, must be ordered)
- Two likes from different users (concurrent, any order OK)
Use case: Collaborative applications

Read-Your-Writes Consistency

User always sees their own updates
Others may have delay
Implementation: Route user's reads to same replica

Monotonic Reads

If user sees value, subsequent reads won't see older value
Prevents: Reading from lagging replica after reading from up-to-date one

5. Advanced Caching

CDN Deep Dive

Push CDN: Origin server pushes content to edge servers proactively
- Pros: Lower latency, predictable
- Cons: Wasted bandwidth for unpopular content
- Use case: Popular content known in advance
Pull CDN: Edge servers pull content on-demand (cache-aside)
- Pros: Only cache requested content
- Cons: First request slow (cold cache)
- Use case: Long-tail content distribution

Edge Computing

Run compute at edge locations (not just caching)
Use cases:
- A/B testing at edge
- Authentication/authorization
- Request manipulation
- Serverless functions
Examples: CloudFlare Workers, Lambda@Edge

Redis Advanced Patterns

Redis as Message Broker

Pub/Sub for real-time messaging
Streams for event sourcing
Pros: Fast, simple
Cons: No message persistence (pub/sub), less feature-rich than Kafka

Redis as Database

Persistence options: RDB snapshots, AOF logs
Use cases: Session store, leaderboards, rate limiting

Redis Data Structures

Strings, Lists, Sets, Sorted Sets, Hashes
HyperLogLog (cardinality estimation)
Bitmaps (user activity tracking)
Geospatial indexes

Application-Level Caching

In-Memory Caching

Libraries: Caffeine (Java), Go-cache, lru-cache (Node.js)
Pros: Ultra-fast (no network)
Cons: Not shared across servers, memory limited

Distributed Caching

Examples: Redis, Memcached
Pros: Shared state, larger capacity
Cons: Network latency, failure modes

Multi-Level Caching

Request → L1 (in-memory) → L2 (Redis) → L3 (Database)

Benefits: Balance speed and size
Invalidation: Coordinate across levels

Cache Invalidation Strategies

Time-based (TTL)

Simplest, works well for slowly-changing data
Risk: Serving stale data until expiration

Event-based

Invalidate on data changes
Methods: Pub/Sub, CDC, explicit invalidation
Pros: Always fresh data
Cons: Complexity, potential race conditions

Write-through/Write-behind

Update cache on writes
Pros: Cache always current
Cons: Write overhead

6. Observability

Backoff, Jitter, and Retry Strategies

Exponential Backoff

delay = base_delay * (2 ^ attempt)
Example: 1s, 2s, 4s, 8s, 16s...

Problem: Thundering herd (all clients retry simultaneously after backoff)

Jitter

delay = base_delay * (2 ^ attempt) * random(0.5, 1.5)

Benefit: Spreads out retries, prevents synchronized thundering herd
Types:
- Full jitter: Random between 0 and max delay
- Equal jitter: Half fixed, half random
- Decorrelated jitter: Each attempt varies independently

Circuit Breaker Pattern

States:

Closed: Normal operation, requests pass through
Open: Errors exceed threshold, block requests immediately (fail fast)
Half-Open: After timeout, allow test request
- Success → Close circuit
- Failure → Reopen circuit

Benefits:

Prevent cascading failures
Give downstream service time to recover
Fast failure instead of waiting for timeout

Logging Best Practices

Structured Logging

{
  "timestamp": "2025-10-02T10:30:00Z",
  "level": "ERROR",
  "service": "payment-service",
  "trace_id": "abc-123",
  "user_id": "456",
  "message": "Payment processing failed",
  "error": "insufficient funds"
}

Benefits: Machine-parseable, searchable, aggregatable

Log Levels

ERROR: Failures requiring immediate attention
WARN: Unexpected but handled (retry success, degraded mode)
INFO: Important business events (user signup, payment)
DEBUG: Detailed diagnostic info (development only)

Correlation IDs

Unique ID per request, propagated across services
Benefit: Trace request flow through distributed system
Implementation: X-Request-ID header

What NOT to Log

Passwords, API keys, PII (privacy/security)
Excessive debugging in production (cost/noise)
Every database query (performance)

Monitoring & Metrics

Key Metrics (RED Method)

Rate: Requests per second
Errors: Error rate/count
Duration: Latency (p50, p95, p99)

USE Method (for Resources)

Utilization: % time resource busy (CPU, memory)
Saturation: Queue length (requests waiting)
Errors: Error count

Golden Signals (Google SRE)

Latency: Time to serve request
Traffic: Request volume
Errors: Failed request rate
Saturation: System fullness (how close to capacity)

Metric Types

Counter: Monotonically increasing (requests_total)
Gauge: Current value (cpu_usage, active_connections)
Histogram: Distribution of values (request_duration_seconds)
Summary: Like histogram but calculates quantiles

Prometheus & Grafana

Prometheus: Time-series database, pull-based scraping
- PromQL query language
- Alert manager integration
Grafana: Visualization dashboards
- Multiple data sources
- Alerting, annotations

Alerting Best Practices

Alert Fatigue Prevention

Alert on symptoms, not causes (user-facing issues, not low-level)
Make alerts actionable (clear remediation steps)
Avoid duplicate alerts
Use severity levels (critical vs warning)

On-Call Considerations

Runbooks: Step-by-step troubleshooting guides
Escalation policies: Who to notify, when
Postmortems: Blameless analysis after incidents

Distributed Tracing

Concept

Track request flow across microservices
Visualize latency bottlenecks
Identify failing service in chain

Implementation

Trace: Single request journey
Span: Individual operation (DB query, HTTP call)
Trace ID: Unique identifier for entire trace
Span ID: Unique identifier for operation

Tools

Jaeger: Open-source, CNCF project
Zipkin: Twitter-originated
OpenTelemetry: Vendor-neutral standard (merges OpenTracing + OpenCensus)

Example Trace

Frontend (50ms)
  ├─ Auth Service (10ms)
  ├─ Product Service (30ms)
  │   └─ Database Query (25ms)  ← Bottleneck!
  └─ Payment Service (5ms)

SLO, SLI, SLA

SLI (Service Level Indicator)

Definition: Metric representing system health
Examples:
- Request success rate
- Request latency (p95 < 200ms)
- System uptime

SLO (Service Level Objective)

Definition: Target value/range for SLI
Example: "99.9% of requests succeed"
Purpose: Internal goal for reliability

SLA (Service Level Agreement)

Definition: Contract with users (consequences if SLO missed)
Example: "99.9% uptime or customer gets refund"
Relationship: SLA ≤ SLO (SLO should be stricter to have buffer)

Error Budgets

Concept: Acceptable downtime based on SLO
Example: 99.9% SLO = 43 minutes downtime/month allowed
Usage: If budget exhausted, freeze features and focus on reliability

7. Security & Privacy

Authentication vs Authorization

Authentication

Definition: Verifying who the user is
Methods:
- Username/password
- Multi-factor authentication (MFA)
- Biometrics
- Certificate-based

Authorization

Definition: What the user can do
Models:
- RBAC (Role-Based): User has roles (admin, editor), roles have permissions
- ABAC (Attribute-Based): Policies based on attributes (department, clearance level)
- ACL (Access Control List): Resource lists who can access

Authentication Mechanisms

JWT (JSON Web Tokens)

Structure: Header.Payload.Signature

{
  "sub": "user123",
  "name": "John Doe",
  "exp": 1730000000
}

Pros: Stateless, self-contained, scalable
Cons: Can't revoke (until expiry), token size
Use case: API authentication, microservices

OAuth 2.0

Purpose: Delegated authorization (allow app access without sharing password)
Flow Example (Authorization Code):
1. User clicks "Login with Google"
2. Redirect to Google (authorization server)
3. User approves
4. Google redirects back with authorization code
5. App exchanges code for access token
6. App uses token to access Google APIs
Roles:
- Resource Owner (user)
- Client (your app)
- Authorization Server (Google)
- Resource Server (Google APIs)

SSO (Single Sign-On)

Definition: One login for multiple applications
Protocols: SAML, OAuth 2.0, OpenID Connect
Benefits: Better UX, centralized access control
Use case: Enterprise applications

Session-Based Authentication

Flow:
1. User logs in
2. Server creates session, stores in DB/Redis
3. Returns session ID cookie
4. Client sends cookie with requests
Pros: Can revoke immediately
Cons: Stateful, harder to scale (requires sticky sessions or shared session store)

Encryption

TLS/HTTPS

Purpose: Encrypt data in transit
Handshake:
1. Client Hello (supported cipher suites)
2. Server Hello (chosen cipher, certificate)
3. Client verifies certificate
4. Key exchange (establish shared secret)
5. Encrypted communication
TLS 1.3: Faster handshake, stronger security

Data at Rest

Encryption: AES-256 (symmetric encryption)
Key management: HSM (Hardware Security Module), KMS (Key Management Service)
Database encryption:
- Full disk encryption
- Column-level encryption (for sensitive fields)
Application-level encryption: Encrypt before storing

Hashing vs Encryption

Hashing: One-way (passwords)
- Use bcrypt, Argon2 (not MD5, SHA1)
- Salt to prevent rainbow tables
Encryption: Two-way (reversible with key)
- Use AES, RSA

DDoS Protection & WAF

DDoS (Distributed Denial of Service)

Types:

Volumetric: Flood with traffic (UDP flood, amplification attacks)
Protocol: Exploit protocol weaknesses (SYN flood)
Application Layer: Target application (HTTP flood)

Mitigation:

Rate limiting: Per IP, per endpoint
CDN: Absorb traffic at edge
Anycast: Distribute traffic across locations
Traffic analysis: Identify and block malicious patterns
Overprovisioning: Have excess capacity

WAF (Web Application Firewall)

Purpose: Filter malicious HTTP traffic
Protection against:
- SQL injection
- XSS (Cross-Site Scripting)
- CSRF (Cross-Site Request Forgery)
- Path traversal
Types:
- Network-based (hardware appliance)
- Host-based (integrated in app)
- Cloud-based (CloudFlare, AWS WAF)
Rules: Signature-based, behavioral analysis

Key Principles

Lawful basis: Need consent or legitimate interest
Data minimization: Collect only necessary data
Purpose limitation: Use data only for stated purpose
Storage limitation: Don't keep data longer than needed
Accuracy: Keep data up-to-date
Security: Protect with encryption, access controls

User Rights

Right to access: User can request their data
Right to erasure: "Right to be forgotten" (delete data)
Right to portability: Export data in machine-readable format
Right to rectification: Correct inaccurate data

Implementation Considerations

Data inventory: Know what PII you collect
Consent management: Track and honor user consent
Data retention policies: Auto-delete old data
Breach notification: Report breaches within 72 hours
Privacy by design: Build privacy into systems from start

8. Infrastructure & Deployment

Containers & Orchestration

Docker

Key Concepts:

Image: Read-only template (base OS + app + dependencies)
Container: Running instance of image
Dockerfile: Instructions to build image
Layers: Each instruction creates layer (cached for efficiency)

Benefits:

Consistent environments (dev = prod)
Lightweight (vs VMs)
Fast startup
Isolated processes

Kubernetes (K8s)

Architecture:

Control Plane: Master node(s)
- API Server
- Scheduler (assigns pods to nodes)
- Controller Manager (maintains desired state)
- etcd (distributed config store)
Worker Nodes: Run pods
- kubelet (agent)
- kube-proxy (networking)
- Container runtime (Docker, containerd)

Key Resources:

Pod: Smallest unit (1+ containers)
Deployment: Manages replica sets, rolling updates
Service: Stable endpoint for pods (load balancing)
ConfigMap: Configuration data
Secret: Sensitive data (encrypted)
Ingress: HTTP(S) routing to services
Namespace: Virtual clusters for isolation

Benefits:

Auto-scaling (HPA - Horizontal Pod Autoscaler)
Self-healing (restart failed pods)
Rolling updates, rollbacks
Service discovery
Storage orchestration

CI/CD Pipelines

Continuous Integration (CI)

Goal: Frequently merge code to main branch
Pipeline:
1. Code commit triggers build
2. Run tests (unit, integration)
3. Static analysis (linting, security scans)
4. Build artifacts (Docker images)
Benefits: Catch bugs early, reduce merge conflicts

Continuous Deployment (CD)

Goal: Automatically deploy to production
Pipeline:
1. Successful CI build
2. Deploy to staging
3. Run E2E tests
4. Deploy to production (if tests pass)

Deployment Strategies

Blue-Green Deployment

Setup: Two identical environments (Blue = current, Green = new)
Process:
1. Deploy new version to Green
2. Test Green
3. Switch traffic from Blue to Green
4. Keep Blue for quick rollback
Pros: Zero downtime, instant rollback
Cons: Double resources

Canary Deployment

Process:
1. Deploy new version to small % of servers (5%)
2. Monitor errors, performance
3. Gradually increase % (10%, 25%, 50%, 100%)
4. Rollback if issues
Pros: Lower risk, real-world testing
Cons: Slower rollout, complex routing

Rolling Deployment

Process: Update servers one-by-one (or in batches)
Pros: No extra resources
Cons: Mixed versions during rollout, slower rollback

Feature Flags

Deploy code with features disabled
Enable features gradually (per user, %)
Pros: Decouple deployment from release, A/B testing
Cons: Code complexity, technical debt

Service Discovery

Problem

Microservices need to find each other
IPs/ports change dynamically (scaling, failures)

Client-Side Discovery

Process: Client queries service registry, chooses instance, makes request
Examples: Netflix Eureka
Pros: Client controls load balancing
Cons: Client complexity, tight coupling to registry

Server-Side Discovery

Process: Client requests load balancer, load balancer queries registry
Examples: AWS ELB, Kubernetes Service
Pros: Client simplicity
Cons: Load balancer is potential bottleneck/SPOF

Service Registry

Examples: Consul, etcd, ZooKeeper
Features:
- Health checks
- Automatic deregistration of failed services
- DNS interface

Kubernetes Service Discovery

Built-in via DNS and environment variables
ClusterIP service provides stable virtual IP
DNS: service-name.namespace.svc.cluster.local

API Gateway

Purpose

Single entry point for clients
Abstracts backend complexity

Responsibilities

Routing: Direct requests to appropriate microservice
Authentication/Authorization: Centralized security
Rate limiting: Protect backend services
Request/Response transformation: Adapt protocols/formats
Caching: Reduce backend load
Logging & Monitoring: Centralized observability
SSL termination: Handle TLS at gateway

Patterns

Backend for Frontend (BFF): Separate gateway per client type (web, mobile, IoT)

Examples

Kong, AWS API Gateway, Apigee, Zuul

Microservices vs Monoliths

Monolith

Pros:

Simple to develop, test, deploy (initially)
No network latency between components
Easier transactions (single DB)
Simpler debugging

Cons:

Scales as one unit (can't scale component independently)
Tech stack lock-in
Deployment risk (entire app redeployed)
Codebase becomes unwieldy

Microservices

Pros:

Independent scaling per service
Technology diversity
Fault isolation (one service failure doesn't crash all)
Faster deployments (small, independent)
Team autonomy

Cons:

Distributed system complexity (network failures, latency)
Data consistency challenges
Increased operational overhead (monitoring, deployment)
Testing complexity

When to Use

Monolith: Startups, simple domains, small teams
Microservices: Large orgs, complex domains, independent team scaling

Migration Strategy

Start with monolith
Extract services as domain understanding grows
"Strangler Fig" pattern (gradually replace monolith pieces)

9. Special Topics

Search Systems

Inverted Index (Deep Dive)

Structure:

Term → [Doc1, Doc2, ...]

"hello" → [doc1, doc3, doc5]
"world" → [doc1, doc2]

With Positions (for phrase queries):

"hello" → {doc1: [0, 15], doc3: [5]}

Search Process:

Tokenize query: "hello world" → ["hello", "world"]
Lookup each term in index
Intersect posting lists: doc1 (appears in both)
Rank results by relevance

Ranking Algorithms

TF-IDF (Term Frequency-Inverse Document Frequency)

TF: How often term appears in document
IDF: How rare term is across all documents
Score: TF × IDF (common terms in rare documents rank high)

BM25 (Best Match 25)

Improved TF-IDF with diminishing returns
Considers document length normalization
Industry standard

Elasticsearch Architecture

Cluster: Multiple nodes
Index: Collection of documents (like database)
Shard: Subset of index data (for horizontal scaling)
Replica: Copy of shard (for availability)

Query Types:

Match query (full-text search)
Term query (exact match)
Range query (dates, numbers)
Bool query (AND, OR, NOT)
Fuzzy query (typo tolerance)

Bloom Filters

Problem

Check if element exists in set
Traditional: Hash table (space inefficient for large sets)

Bloom Filter

Data structure: Bit array + k hash functions
Add: Set bits at k hash positions to 1
Check: If all k bits are 1, element might exist
False positives: Possible (bits set by other elements)
False negatives: Impossible (if bits set, element was definitely added or collision)

Use Cases

Database: Check if key exists before expensive disk lookup
Web: Block malicious URLs (quick check before full validation)
Distributed systems: Reduce unnecessary network calls
Example: Google Chrome uses bloom filters for malicious site detection

Trade-off

Space efficient (small bit array)
Tunable false positive rate (more bits/hashes = fewer false positives)

Recommendation Systems

Collaborative Filtering

User-based:

Find similar users (based on past behavior)
Recommend items those users liked
Example: Users who liked A and B also liked C

Item-based:

Find similar items (based on user interactions)
Recommend similar items to what user liked
Example: People who liked this movie also liked...

Matrix Factorization (Netflix Prize winner):

Decompose user-item matrix into latent factors
Predict missing ratings

Content-Based Filtering

Recommend based on item attributes
Example: User likes sci-fi movies → recommend other sci-fi

Hybrid Approaches

Combine collaborative + content-based
Cold start problem: Use content-based for new users/items

Ranking

Factors: Relevance, popularity, diversity, freshness
ML models: Gradient boosting, neural networks
A/B testing: Compare ranking algorithms

Distributed Transactions

Problem

Transaction spans multiple databases/services
Need ACID guarantees across systems

Two-Phase Commit (2PC)

Phase 1 (Prepare):

Coordinator asks all participants: "Can you commit?"
Participants lock resources, respond yes/no

Phase 2 (Commit/Abort):

If all said yes: Coordinator sends "commit" to all
If any said no: Coordinator sends "abort" to all

Problems:

Blocking: If coordinator crashes, participants locked
Single point of failure: Coordinator
Performance: Synchronous, slow

Saga Pattern

Concept: Break transaction into local transactions, compensate on failure

Example (booking trip):

Book flight (local tx)
Book hotel (local tx)
Book car (local tx) If step 3 fails → compensate: cancel hotel, cancel flight

Types:

Choreography: Services communicate via events (decentralized)
Orchestration: Central coordinator (like 2PC but async)

Pros: No blocking, better availability Cons: Eventual consistency, complex compensation logic

When to Use

2PC: When strong consistency absolutely required (rare)
Saga: Most distributed systems (accept eventual consistency)
Avoid distributed transactions: Design to avoid need (bounded contexts)

Consensus Algorithms

Why Needed?

Distributed systems need to agree on values
Leader election, configuration, distributed locks

Raft (Understandable Consensus)

Roles:

Leader: Handles all client requests
Follower: Passive, replicate leader's log
Candidate: Follower becomes candidate during election

Leader Election:

Leader sends heartbeats
If follower doesn't hear heartbeat (timeout) → becomes candidate
Candidate requests votes from other nodes
Majority votes → becomes leader

Log Replication:

Leader receives command, appends to log
Sends log entry to followers
When majority replicate → entry committed
Leader notifies followers, applies to state machine

Guarantees:

Only one leader per term
Logs eventually identical across servers
Committed entries durable

Paxos

More complex than Raft (harder to understand/implement)
Three roles: Proposers, Acceptors, Learners
Multi-Paxos optimized for multiple decisions
Used in: Google Chubby, Spanner

Practical Usage

Don't implement yourself: Use existing (etcd, Consul, ZooKeeper)
Use for: Leader election, distributed config, locking
Not for: Every coordination need (too heavyweight)

Time & Ordering in Distributed Systems

Problem

No global clock in distributed systems
Clock skew (servers have different times)
Need to order events across servers

Lamport Clocks (Logical Time)

Each process maintains counter
Rules:
1. Increment counter before each event
2. Send counter with message
3. Receiver sets counter = max(local, received) + 1
Property: If event A happened-before B, then timestamp(A) < timestamp(B)
Limitation: Converse not true (can't determine causality from timestamps alone)

Vector Clocks

Each process maintains vector of counters (one per process)
Example: [P1:3, P2:5, P3:2]
Rules:
1. Increment own counter on event
2. Send entire vector with message
3. Receiver merges vectors (max of each component)
Property: Can determine causality
- A happened-before B: VA < VB (component-wise)
- Concurrent events: Neither VA < VB nor VB < VA

Use Cases

Lamport clocks: Total ordering of events (distributed snapshots)
Vector clocks: Conflict detection (Riak, Dynamo)
- Example: Detect concurrent updates to same key

True Time (Google Spanner)

Uses atomic clocks + GPS for global time
Time is interval (t ± ε) accounting for uncertainty
Wait out uncertainty before committing (ensures causality)

10. Additional Important Topics

Back-of-the-Envelope Calculations

Common Numbers (Latency)

L1 cache: 0.5 ns
L2 cache: 7 ns
RAM: 100 ns
SSD read: 150 μs
HDD seek: 10 ms
Network within datacenter: 0.5 ms
Round trip CA to Netherlands: 150 ms

Storage Capacity

1 KB = 1,000 bytes
1 MB = 1,000 KB
1 GB = 1,000 MB
1 TB = 1,000 GB
1 PB = 1,000 TB

Traffic Estimates

Example: 100M DAU, average 10 requests/day
- QPS = 100M × 10 / 86400 ≈ 11,574 req/s
- Peak QPS ≈ 2-3× average ≈ 30,000 req/s

Storage Estimates

Example: 1M tweets/day, 280 chars average, 5 years retention
- 280 bytes × 1M × 365 × 5 ≈ 500 GB

Polling vs Push vs Long Polling

Polling

Client periodically requests updates
Pros: Simple, stateless
Cons: Wasted requests (if no updates), delayed updates

Push (WebSockets, SSE)

Server pushes updates when available
Pros: Real-time, efficient
Cons: Complex, stateful connections

Long Polling

Client requests, server holds until update available (or timeout)
Pros: More real-time than polling, better compatibility than WebSockets
Cons: Still overhead of reconnections

Database Connection Pooling

Problem: Creating DB connections is expensive (TCP handshake, auth)
Solution: Pool of reusable connections
Benefits: Faster response, controlled max connections
Configuration: Min/max pool size, connection timeout, idle timeout

Partitioning vs Sharding

Often used interchangeably
Partitioning: Splitting data (can be on same server)
- Horizontal: Split rows (same schema)
- Vertical: Split columns (different tables)
Sharding: Horizontal partitioning across multiple servers

Webhooks

Concept: Server calls client URL when event occurs
Use cases: Payment notifications, GitHub push events
Considerations:
- Retry logic (client might be down)
- Idempotency (duplicates possible)
- Security (validate sender, HTTPS)

Reverse Hash Lookup (Distributed Hash Table)

Use case: P2P systems (BitTorrent, blockchain)
Concept: Hash key maps to node responsible for storing it
Consistent hashing: Add/remove nodes with minimal reshuffling

Interview Preparation Tips

How to Approach System Design Interviews

1. Clarify Requirements (5 min)

Functional: What features? (read/write, search, notifications)
Non-functional: Scale (users, requests/sec), latency, availability
Constraints: Budget, timeline, existing infrastructure

2. Back-of-Envelope Estimates (5 min)

Calculate QPS, storage, bandwidth
Determine scale tier (thousands vs millions vs billions)

3. High-Level Design (10-15 min)

Draw main components (client, load balancer, servers, databases, cache)
API design (key endpoints, request/response)
Data model (tables, relationships)

4. Deep Dive (15-20 min)

Interviewer will probe specific areas
Be ready to discuss: scaling, failures, bottlenecks, trade-offs
Common deep dives: Database choice, caching strategy, consistency model

5. Wrap Up (5 min)

Monitoring, metrics, alerts
Potential improvements, future scaling

Common Mistakes to Avoid

Jumping to solution without clarifying requirements
Over-engineering (don't add Kafka if simple queue suffices)
Ignoring trade-offs (every decision has pros/cons)
Not considering failures (what if DB goes down?)
Forgetting about monitoring/observability

Key Trade-offs to Discuss

Consistency vs Availability (CAP theorem)
Latency vs Throughput (batch processing vs real-time)
Normalization vs Denormalization (storage vs query speed)
SQL vs NoSQL (ACID vs scalability)
Monolith vs Microservices (simplicity vs scalability)
Synchronous vs Asynchronous (simplicity vs performance)

Practice Questions

Design Twitter/Instagram
Design URL shortener
Design video streaming (YouTube/Netflix)
Design messaging system (WhatsApp)
Design ride-sharing (Uber)
Design newsfeed
Design web crawler
Design search autocomplete
Design rate limiter
Design distributed cache
Design key-value store
Design notification system

Quick Reference

When to Use SQL vs NoSQL

Use SQL when:

Need ACID transactions
Complex queries with joins
Structured, relational data
Data integrity critical

Use NoSQL when:

Massive scale (horizontal scaling)
Flexible schema
High write throughput
Eventual consistency acceptable

Caching Decision Tree

Frequently accessed data? → Yes → Cache it
Read-heavy or write-heavy?
- Read-heavy → Cache-aside
- Write-heavy → Write-through or write-behind
Consistency critical?
- Yes → Write-through
- No → Cache-aside with TTL

Database Replication Strategy

Read-heavy → Master-slave
Write-heavy + global → Master-master
Strong consistency → Master-slave with sync replication
High availability → Master-master or multi-region

Message Queue vs Database

Use Queue when: Async processing, decoupling, load leveling
Use Database when: Need to query data, ACID required, persistent storage

This guide covers the essential system design topics for interviews. Remember: there's rarely one "correct" answer in system design. Focus on demonstrating your thought process, understanding trade-offs, and designing for the stated requirements.

Table of Contents​

1. Networking & Communication​

2. Storage & Databases​

3. Scalability & Reliability​

4. System Design Patterns​

5. Advanced Caching​

6. Observability​

7. Security & Privacy​

8. Infrastructure & Deployment​

9. Special Topics​

10. Additional Important Topics​

Interview Preparation​

Quick Reference​

1. Networking & Communication​

IP Addressing​

DNS (Domain Name System)​

Load Balancers​

L4 (Layer 4 - Transport Layer)​

L7 (Layer 7 - Application Layer)​

Proxy​

Forward Proxy​

Reverse Proxy​

TCP vs UDP​

HTTP/HTTPS Basics​

HTTP Methods​

Important Headers​

HTTP Status Codes​

REST vs GraphQL vs gRPC​

Real-Time Communication​

WebSockets​

Long Polling​

Server-Sent Events (SSE)​

CDN (Content Delivery Network)​

2. Storage & Databases​

Relational Databases (SQL)​

ACID Properties​

Isolation Levels​

Normalization​

Database Replication Patterns​

Master-Slave (Primary-Replica)​

Master-Master (Multi-Master)​

NoSQL Databases​

Key-Value Stores​

Document Databases​

Wide-Column Stores​

Graph Databases​

Database Indexes​

B-Tree Indexes​

Hash Indexes​

Inverted Indexes​

Geospatial Indexes​

Replication & Sharding​

Horizontal Scaling (Sharding)​

Vertical Scaling​

CAP Theorem​

Query Optimization​

Techniques​

Common Issues​

CDC (Change Data Capture)​

Full-Text Search​

Caching Strategies​

Cache Levels​

Cache Patterns​

Cache Eviction Policies​

Cache Stampede (Thundering Herd)​

3. Scalability & Reliability​

Load Balancing Algorithms​

Round Robin​

Least Connections​

Consistent Hashing​

IP Hash​

Least Response Time​

Rate Limiting & Throttling​

Why Rate Limit?​

Algorithms​

Implementation​

Message Queues & Streams​

Use Cases​

Kafka​

RabbitMQ​

Table of Contents

1. Networking & Communication

2. Storage & Databases

3. Scalability & Reliability

4. System Design Patterns

5. Advanced Caching

6. Observability

7. Security & Privacy

8. Infrastructure & Deployment

9. Special Topics

10. Additional Important Topics

Interview Preparation

Quick Reference

1. Networking & Communication

IP Addressing

DNS (Domain Name System)

Load Balancers

L4 (Layer 4 - Transport Layer)

L7 (Layer 7 - Application Layer)

Proxy

Forward Proxy

Reverse Proxy

TCP vs UDP

HTTP/HTTPS Basics

HTTP Methods

Important Headers

HTTP Status Codes

REST vs GraphQL vs gRPC

Real-Time Communication

WebSockets

Long Polling

Server-Sent Events (SSE)

CDN (Content Delivery Network)

2. Storage & Databases

Relational Databases (SQL)

ACID Properties

Isolation Levels

Normalization

Database Replication Patterns

Master-Slave (Primary-Replica)

Master-Master (Multi-Master)

NoSQL Databases

Key-Value Stores

Document Databases

Wide-Column Stores

Graph Databases

Database Indexes

B-Tree Indexes

Hash Indexes

Inverted Indexes

Geospatial Indexes

Replication & Sharding

Horizontal Scaling (Sharding)

Vertical Scaling

CAP Theorem

Query Optimization

Techniques

Common Issues

CDC (Change Data Capture)

Full-Text Search

Caching Strategies

Cache Levels

Cache Patterns

Cache Eviction Policies

Cache Stampede (Thundering Herd)

3. Scalability & Reliability

Load Balancing Algorithms

Round Robin

Least Connections

Consistent Hashing

IP Hash

Least Response Time

Rate Limiting & Throttling

Why Rate Limit?

Algorithms

Implementation

Message Queues & Streams

Use Cases

Kafka

RabbitMQ